Skip to content

add intra_distance_score evaluation#103

Open
omkarakaya wants to merge 13 commits intomaciejkula:masterfrom
omkarakaya:intra_distance
Open

add intra_distance_score evaluation#103
omkarakaya wants to merge 13 commits intomaciejkula:masterfrom
omkarakaya:intra_distance

Conversation

@omkarakaya
Copy link
Copy Markdown

Issue #90 (Diversification metrics for evaluation)

Intra_distance diversity is probably mostly considered metric of diversity.
Therefore I'd like to add it first.

@maciejkula
Copy link
Copy Markdown
Owner

maciejkula commented Mar 17, 2018

Thank you for this!

I think to move this forward, we'll want to do two things.

  1. Can we add some references to the metric in the docstring? Maybe some papers that use it?
  2. We'll need to make this fast. From a cursory glance at the code, I suspect it's incredibly slow.

On making this fast: as far as I understand, we want to compute the cosine distance between the columns i and j of the (sparse) interaction matrix. There are a couple of things that we can do to make this faster than the current implementation:

  1. We convert the interactions object to a sparse matrix, transpose it, and make it CSR. This way, rows represent items. Call this mat.
  2. We get the lengths of the item vectors by calling lenghts = mat.getnnz(axis=1) (I think, the axis argument always throws me).
  3. We can get the length of the intersection of i and j by doing
numerator = np.in1d(mat[i].indices, mat[j].indices, assume_unique=True).sum()
denominator = lengths[i] * lengths[j]
distance = numerator / denominator

This way we don't need the cache either.

  1. Let's return this in a (num_users, k * (k-1) / 2) array (it's a list of lists right now).

We can make it even faster by using the fact that indices are sorted and using numba, but this is probably a good first step.

Copy link
Copy Markdown
Owner

@maciejkula maciejkula left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes, I made some additional comments.

More generally:

  1. Can you give me a sense of how slow/fast this is? Could you post how long it takes for the Movielens 100k dataset, for instance?
  2. We will need a version of this for sequence-based models. Have a look at the MRR routines to see how the two differ.

Comment thread spotlight/evaluation.py Outdated

distances = []
test = test.tocsr()
mat = train.tocoo().T.tocsr()
Copy link
Copy Markdown
Owner

@maciejkula maciejkula Apr 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the coo conversion necessary?

Comment thread spotlight/evaluation.py Outdated
"""

distances = []
test = test.tocsr()
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do we need test for? For knowing which users to compute the predictions for?

Maybe a cleaner way of doing the same would be to allow the user to pass in an optional array of user ids for which the metric should be computed.

Comment thread spotlight/evaluation.py Outdated
for user_id, row in enumerate(test):
predictions = -model.predict(user_id)
rec_list = predictions.argsort()[:k]
distance = [
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally find nested list comprehensions very confusing. Could we use nested for loops here?

@omkarakaya
Copy link
Copy Markdown
Author

Thank you very much for the comments. I agree.
I'll check and update accordingly.

In addition to these comments, I'm still looking for a way to move distance function to input parameters. The current function will be the default one since it's really fast (I'll post exact times separately)

@omkarakaya
Copy link
Copy Markdown
Author

I've fixed the review comments except for the sequence-based models solution.

calling intra_distance_score function in tests takes 6 seconds (only the function) when I run locally.

I should check how we can achieve to run this on sequence-based models;
We need to convert the sequence to items array with user_ids in order to compute the distance between items.

So we need a matrix like this to calculate the distance;
[[userId1, userId2, userId3],
[userId2, userId3]]

where each row represents an item.

Any advise or guidance would be greatly appreciated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants